Understanding Diffusion Models: A Unified Perspective

This is a learning note of this series of videos.

Link to the paper: https://arxiv.org/abs/2208.11970

1. Introduction

Given observed samples $\bold{x}$ from a distribution of interest, the goal of a generative model is to learn to model its true data distribution $p(\bold{x})$ . This learned model can be used to generate new samples or evaluate the likelihood of observed/sampled data.

We usually think of the data we observed as represented or generated by an associated unseen latent variable, which can be denoted by random variable $\bold{z}$ .

2. Understand Evidence Lower Bound (ELBO)

We can imagine the latent variables and the data we observe as modeled by a joint distribution $p(\bold{x}, \bold{z})$ . The likelihood-based modeling learn a model to maximize the likelihood $p(\bold{x})$ of all observed $\bold{x}$ . There are two ways to recover $p(\bold{x})$ :

Explicitly marginalize out the latent variable $\bold{z}$ :
$\begin{equation} p(\bold{x}) = \int p(\bold{x}, \bold{z}) d \bold{z} \end{equation}$
Difficult because integrating out all latent variables $\bold{z}$ is intractable for complex models.

Chain rule of probability:
$\begin{equation} p(\bold{x}) = \frac{p(\bold{x}, \bold{z})}{p(\bold{x} | \bold{z})} \end{equation}$
Difficult because we have no access to a ground truth latent encoder $p(\bold{z} | \bold{x})$ .

We call the log likelihood $\log p(\bold{x})$ by “evidence”. Then we can derive a term called the Evidence Lower Bound (ELBO), which is a lower bound of the evidence. Maximizing the ELBO becomes a proxy objective with which to optimize a latent variable model.

Formally, the equation of the ELBO is:

\begin{equation} \mathbb{E}_{q_{\phi}(\bold{z} | \bold{x})} \left[ \log \frac{p(\bold{x}, \bold{z})}{q_{\phi}(\bold{z} | \bold{x})} \right] \end{equation}

Its relationship with the evidence (log likelihood) is written as:

\begin{equation} \log p(\bold{x}) \geq \mathbb{E}_{q_{\phi}(\bold{z} | \bold{x})} \left[ \log \frac{p(\bold{x}, \bold{z})}{q_{\phi}(\bold{z} | \bold{x})} \right] \end{equation}

Here $q_{\phi}(\bold{z} | \bold{x})$ is a flexible approximate variational distribution with parameters $\phi$ that we seek to optimize. It seeks to approximate the true posterior $p(\bold{z} | \bold{x})$ .

How to derive the ELBO? (Why ELBO is an objective we would like to maximize).

To better understand the relationship between the evidence and the ELBO, here is another derivation:

\begin{align} \log p(\bold{x}) &= \log p(\bold{x}) \int q_{\phi}(\bold{z} | \bold{x}) && \text{(Multiply by 1)} \qquad \\ &= \int q_{\phi}(\bold{z} | \bold{x}) (\log p(\bold{x})) \\ &= \mathbb{E}_{q_{\phi}(\bold{z} | \bold{x})}[\log p(\bold{x})] && \text{(Definition of Expectation)} \qquad \\ &= \mathbb{E}_{q_{\phi}(\bold{z} | \bold{x})} \left[ \log \frac{p(\bold{x}, \bold{z})}{p(\bold{z} | \bold{x})} \right] && \text{(Apply Bayes Rule)} \qquad \\ &= \mathbb{E}_{q_{\phi}(\bold{z} | \bold{x})} \left[ \log \frac{p(\bold{x}, \bold{z}) q_{\phi}(\bold{z} | \bold{x})}{p(\bold{z} | \bold{x})q_{\phi}(\bold{z} | \bold{x})} \right] && \text{(Multiply by 1)} \qquad \\ &= \mathbb{E}_{q_{\phi}(\bold{z} | \bold{x})} \left[ \log \frac{p(\bold{x}, \bold{z})}{q_{\phi}(\bold{z} | \bold{x})} \right] + \mathbb{E}_{q_{\phi}(\bold{z} | \bold{x})} \left[ \log \frac{q_{\phi}(\bold{z} | \bold{x})}{p(\bold{z} | \bold{x})} \right] && \text{(Split Expectation)} \\ &= \mathbb{E}_{q_{\phi}(\bold{z} | \bold{x})} \left[ \log \frac{p(\bold{x}, \bold{z})}{q_{\phi}(\bold{z} | \bold{x})} \right] + D_{\text{KL}}(q_{\phi}(\bold{z} | \bold{x})||p(\bold{z} | \bold{x}))&& \text{(Definition of KL Divergence)} \\ &\geq \mathbb{E}_{q_{\phi}(\bold{z} | \bold{x})} \left[ \log \frac{p(\bold{x}, \bold{z})}{q_{\phi}(\bold{z} | \bold{x})} \right] && \text{(KL Divergence always} \geq 0 ) \end{align}

From equation (15) we observe that the evidence equal to the ELBO plus the KL Divergence between the approximate posterior $q_{\phi}(\bold{z} | \bold{x})$ and the true posterior $p(\bold{z} | \bold{x})$ . Note that the left hand side of equation (15) is the evidence, which is a constant w.r.t the model parameter $\phi$ . Since the ELBO and the KL Divergence always sum up to a constant, any maximization of the ELBO term with respect to $\phi$ necessarily invokes an equal minimization of the KL Divergence term. Thus, the ELBO can be maximized as a proxy for learning how to perfectly model the true latent posterior distribution.

\begin{align} \log p(\bold{x}) &= \log \int p(\bold{x}, \bold{z}) dz \\ &= \log \int \frac{p(\bold{x}, \bold{z}) q_{\phi}(\bold{z} | \bold{x})}{q_{\phi}(\bold{z} | \bold{x})} && \text{(Multiply by 1)} \qquad \\ &= \log \mathbb{E}_{q_{\phi}(\bold{z} | \bold{x})}\left[ \frac{p(\bold{x}, \bold{z})}{q_{\phi}(\bold{z} | \bold{x})} \right] && \text{(Definition of Expectation)} \qquad \\ &\geq \mathbb{E}_{q_{\phi}(\bold{z} | \bold{x})}\left[ \log \frac{p(\bold{x}, \bold{z})}{q_{\phi}(\bold{z} | \bold{x})} \right] && \text{(Apply Jensen's Inequality)} \qquad \end{align}

3. Variational AutoEncoder (VAE)

In the default VAE, we directly maximize the ELBO.

It is ‘variational’ because we optimize for the best $q_{\phi}(\bold{z} | \bold{x})$ among a family of potential posterior distributions parameterized by $\phi$ .

It is an ‘autoencoder’ because the input data is trained to predict itself after undergoing an intermediate bottlenecking representation step.

Now, we dissect the ELBO term further:

\begin{align} \mathbb{E}_{q_{\phi}(\bold{z}|\bold{x})} \left[ \log \frac{p(\bold{x}, \bold{z})}{q_{\phi}(\bold{z}|\bold{x})} \right] &= \mathbb{E}_{q_{\phi}(\bold{z}|\bold{x})} \left[ \log \frac{p_{\theta}(\bold{x}|\bold{z}) p(\bold{z})}{q_{\phi}(\bold{z}|\bold{x})} \right] && \text{(Chain Rule of Probability)} \qquad \\ &= \mathbb{E}_{q_{\phi}(\bold{z}|\bold{x})} \left[ \log p_{\theta}(\bold{x} | \bold{z}) \right] + \mathbb{E}_{q_{\phi}(\bold{z}|\bold{x})} \left[ \log \frac{p(\bold{z})}{q_{\phi}(\bold{z}|\bold{x})} \right] && \text{(Split the Expectation)} \qquad \\ &= \underbrace{\mathbb{E}_{q_{\phi}(\bold{z}|\bold{x})} \left[ \log p_{\theta}(\bold{x} | \bold{z}) \right]}_{\text{reconstruction term}} - \underbrace{D_{\text{KL}}(q_{\phi}(\bold{z} | \bold{x})||p(\bold{z}))}_{\text{prior matching term}} && \text{(Definition of KL Divergence)} \end{align}

We learn an intermediate bottlenecking distribution $q_{\phi} (\bold{z} | \bold{x})$ that can be treated as an encoder. We also learn a deterministic function $p_{\theta}(\bold{x} | \bold{z})$ to convert a given latent vector $\bold{z}$ into an observation $\bold{x}$ , which can be interpreted as a decoder.

The first term measures the reconstruction likelihood of the decoder from our variational distribution. The second term measures how similar the learned variational distribution is to a prior belief held over latent variables. Maximizing ELBO is thus equivalent to maximizing its first term and minimizing the second term.

The encoder of the VAE is commonly chosen to model a multivariate Gaussian with diagonal covariance, and the prior is often selected to be a standard multivariate Gaussian:

\begin{align} q_{\phi}(\bold{z} | \bold{x}) &= \mathcal{N}(\bold{z}; \mu_{\phi}(\bold{x}), \sigma_{\phi}^2(\bold{x}) \bold{I}) \qquad \\ p(\bold{z}) &= \mathcal{N}(\bold{z}; \bold{0}, \bold{I}) \end{align}

Then the KL divergence term of the ELBO can be computed analytically, and the reconstruction term can be approximated using a Monte Carlo estimate. A new objective can be rewritten as:

\begin{equation} \argmax_{\phi, \theta} \sum_{l=1}^L \log p_{\theta}(\bold{x} | \bold{z}^{(l)}) - D_{\text{KL}}(q_{\phi}(\bold{z} | \bold{x}) || p(\bold{z})) \end{equation}

where the latents $\{ \bold{z}^{(l)}\}_{l=1}^L$ are sampled from $q_{\phi}(\bold{z} | \bold{x})$ , for every observation $\bold{x}$ in the dataset. A problem of this method is that each $\bold{z}^{(l)}$ that the loss is computed on is generated by sampling a multivariate Gaussian $\mathcal{N}(\bold{z}; \mu_{\phi}(\bold{x}), \sigma_{\phi}^2(\bold{x}) \bold{I})$ , which is non-differentiable. This can be addressed by the reparameterization trick:

\bold{z} = \mu_{\phi}(\bold{x}) + \sigma_{\phi}(\bold{x}) \odot \epsilon \quad \text{with } \epsilon \sim \mathcal{N}(\bold{0}, \bold{I})

The reparameterization trick disentangle the model parameter $\phi$ that we need to perform gradient descent on and the non-differentiable sampling process. In this case, we only need to sample from a standard Gaussian, which doesn’t involve any trainable parameters.

4. Hierarchical Variational Autoencoders (HVAE)

The general HVAE has $T$ hierarchical levels, with each latent is allowed to condition on all previous latents. Here we focus on a special case: Markovian HVAE (MHVAE). in a MHVAE, the generative process is a Markov chain. Mathematically, we represent the joint distribution and the posterior of a Markovian HVAE as:

\begin{align} p(\bold{x}, \bold{z}_{1:T}) &= p(\bold{z}_T)p_{\theta}(\bold{x} | \bold{z}_1) \prod_{t=2}^T p_{\theta}(\bold{z}_{t-1} | \bold{z}_t) \\ q_{\phi}(\bold{z}_{1:T} | \bold{x}) &= q_{\phi}(\bold{z}_1 | \bold{x}) \prod_{t=2}^T q_{\phi} (\bold{z}_t | \bold{z}_{t-1}) \end{align}

Then the ELBO can be extended to:

\begin{align} \log p(\bold{x}) &= \log \int p(\bold{x}, \bold{z}_{1:T}) d \bold{z}_{1:T} && \text{} \\ &= \log \int \frac{p(\bold{x}, \bold{z}_{1:T}) q_{\phi}(\bold{z}_{1:T} | \bold{x})}{q_{\phi}(\bold{z}_{1:T} | \bold{x})} d \bold{z}_{1:T} && \text{(Multiply by 1)} \\ &= \log \mathbb{E}_{q_{\phi}(\bold{z}_{1:T} | \bold{x})} \left[ \frac{p(\bold{x}, \bold{z}_{1:T})}{q_{\phi}(\bold{z}_{1:T} | \bold{x})} \right] && \text{(Definition of Expectation)} \\ &\geq \mathbb{E}_{q_{\phi}(\bold{z}_{1:T} | \bold{x})} \left[ \log \frac{p(\bold{x}, \bold{z}_{1:T})}{q_{\phi}(\bold{z}_{1:T} | \bold{x})} \right] && \text{(Apply Jensen's Inequality)} \end{align}

We can plug in equation (23) and (24) into equation (28) and get:

\begin{equation} \mathbb{E}_{q_{\phi}(\bold{z}_{1:T} | \bold{x})} \left[ \log \frac{p(\bold{x}, \bold{z}_{1:T})}{q_{\phi}(\bold{z}_{1:T} | \bold{x})} \right] = \mathbb{E}_{q_{\phi}(\bold{z}_{1:T} | \bold{x})} \left[ \log \frac{p(\bold{z}_T)p_{\theta}(\bold{x} | \bold{z}_1) \prod_{t=2}^T p_{\theta}(\bold{z}_{t-1} | \bold{z}_t)}{q_{\phi}(\bold{z}_1 | \bold{x}) \prod_{t=2}^T q_{\phi} (\bold{z}_t | \bold{z}_{t-1})}\right] \end{equation}

This equation can be further decomposed into interpretable components when we investigate Variational Diffusion Models.